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We investigate the rate-length scaling law of protein folding, a key undetermined scaling law in 
the analytical theory of protein folding. We demonstrate that chain length is a dominant factor 
determining folding times, and that the unambiguous determination of the way chain length corre- 
lates with folding times could provide key mechanistic insight into the folding process. Four specific 
proposed laws (power law, exponential, and two stretched exponentials) are tested against one an- 
other, and it is found that the power law best explains the data. At the same time, the fit power 
law results in rates that are very fast, nearly unreasonably so in a biological context. We show that 
any of the proposed forms are viable, conclude that more data is necessary to unequivocally infer 
the rate-length law, and that such data could be obtained through a small number of protein folding 
experiments on large protein domains. 



INTRODUCTION 

A deep understanding of protein folding must involve 
a description of the general mechanisms involved. It 
is reasonable to suspect this will consist of a simple 
model, based on microscopic physics, expressed in a sim- 
ple mathematical language. Such a model would show 
how biological sequences are able to employ physics to 
spontaneously self-assemble into intricate molecular ma- 
chines. 

Simple models of this sort often start by postulating a 
mechanism of folding, and then derive the consequences 
of that mechanism [1, 2, 9-18]. This suggests it might be 
possible to infer the general mechanisms of protein fold- 
ing by verifying the specific predictions of these models. 
For a simple model of protein folding, however, there is 
a limited set of general experimental trends that can be 
readily predicted. One such experimental trend is the law 
governing how folding times scale with chain length. This 
is perhaps the simplest comparison of theory and exper- 
iment possible, but has not yet been unambiguously in- 
ferred despite nearly two decades of active research [1, 2]. 

The rate-length law is also an interesting result in and 
of itself. Such a law can be viewed as a statement of 
the computational complexity of protein folding - given 
a problem of size N (residues), how does one expect the 
time-to-solution (folding) to scale? Levinthal pointed out 
that an exhaustive search would result in exponential 
scaling, and suggested that this would result in unreason- 
ably large folding times [3] . Thus, in many ways, a resolu- 
tion to Lcvinthal's paradox is likely to be phrased directly 
as a rate scaling law, either non-exponential (polynomial) 
or exponential with an explicitly small exponential fac- 
tor. 

A number of issues complicate inferring such a law 
from experiment, most importantly the fact that avail- 
able kinetic data on protein folding spans a very limited 
range of chain lengths - about 30 to 300 residues [4] . The 
statistical power of the data is inherently limited by the 
fact that protein domain sizes barely span a single or- 



der of magnitude, and that most studies of folding have 
focused on small, well-behaved model systems. 

Further complicating the inference of the rate-length 
scaling law is the fact that chain length is certainly not 
the only factor affecting folding rates. In fact, it has been 
argued that it is a fairly weak predictor of the folding time 
[5, 6]. For instance, there seems to be some correlation 
of folding times with topological complexity of the native 
state, such that if two proteins have the same number 
of residues, but different folds, they may take different 
amounts of time to fold [5, 7]. Moreover, even protein 
mutants with the same native structure can have at least 
3 orders of magnitude variation in their folding rates [8] . 
Thus, we expect that experimental data on the scaling of 
folding time with chain length should be very noisy, and 
difficult to statistically estimate. 

Nonetheless, there does seem to be a significant corre- 
lation between the number of residues (N) and folding 
times (r). Many different forms of the scaling law have 
been predicted, but all fall into one of three basic classes. 
Shaknovich [2], Cieplak [9, 10], and co-workers have pro- 
posed a power-law, r ~ N v . We recently constructed a 
model that suggested exponential scaling, r <~ e aN [11], 
consistent with predictions made by Zwanzig [12, 13]. Fi- 
nally, Thirumalai [1, 14], Munoz [15], Takada [16], Finkel- 
stein [17], and co-workers have suggested a stretched ex- 
ponentials, t <~ e aN , with /? as 1/2 or 2/3. Wolynes has 
proposed the law may conditionally change between all 
four suggested models [18]. 

In what follows, we develop and apply two methods 
for choosing between these models and evaluate how each 
proposed model performs. 



MODELING 

Below, we outline two complementary methods for in- 
ferring which proposed scaling law is the most reasonable. 
First, we present a method capable of fitting each model's 
parameters to the known data, and examining how well 
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each model explains the data. Next, we investigate a sec- 
ond discriminatory method, which proposes that folding 
times must be below a certain threshold value to be bio- 
logically viable. It has been demonstrated that in the 
crowded milieu of the cell, proteins must fold rapidly 
to avoid aggregation or degradation. We suggest that 
this implies that we can check the reasonableness of any 
model by seeing if its prediction for this threshold time 
is reasonable with what has been observed empirically in 
biology. 

In this study we focus on single-domain globular pro- 
teins. Kinetic data for the folding times of proteins were 
taken from the KincticDB [4] , which reports protein fold- 
ing times at zero denaturant, near room temperature, 
and under neutral pH. Other data sets exist [19-21], but 
were not consistent with one another - despite this, they 
yielded very similar results (see SI, Fig. 2 and Table I). 

Direct Method: Likelihood Maximization 

We want to estimate the parameters for each proposed 
form of the scaling law. In what follows, we adopt a 
model that accounts not only for this scaling law, but 
all other factors (topology, experimental conditions, etc.) 
via a random Gaussian component. Thus, by fitting each 
model, we not only learn parameters for each proposed 
model, but also get an estimate for the relative impor- 
tance of these other factors in determining folding times. 

We assert the following model for the folding time, 



where 



/(AO 



logr/T = f(N)+X 



v log N power-law 

aN exponential 

aN 1 ' 2 stretched exp. (1/2) 
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are the proposed folding rate laws, X represents a ran- 
dom variable distributed as a zero-mean Gaussian, X ~ 
Af(0,a 2 ), and To is a fit constant accounting for units of 
time. By adding X to the logarithm of the folding time 
(1), we model random variation in relative terms, and it 
enters as a multiplicative factor. 

Equation (1) implies r is distributed as a log- normal, 
with location parameter f(N) and scale parameter a. 
The likelihood of the entire data set (assuming n inde- 
pendent measurements) is 
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We have three parameters for each model, a, t , and a 
or v for the exponential and power-law families, respec- 
tively. 



We have fit these parameters by maximizing the like- 
lihood C. Model comparison can then be performed by 
investigating the ratio of the likelihoods of two alterna- 
tive models (Table I). We have adopted the simple like- 
lihood approach (versus a full-fledged Bayesian analy- 
sis) because the number of fit parameters are small and 
equal for each model, the models are simple and low- 
dimensional, and we have little prior information about 
the parameters. See the supplemental information, Fig. 4 
and Table II for a Bayesian analysis and comparison. 



Indirect Method: Biological Limits 

We postulate that there exists a critical time, r c , that 
places a biological upper bound on folding times. Specifi- 
cally, if a protein folds slower than this time (i.e. r > r c ) 
then that protein will be much more likely to aggregate 
during the course of folding, and therefore is evolution- 
arily selected against. 

The majority of biologically observed proteins should 
have folding times less than r c , but we postulate that 
some proteins will have greater times. These proteins are 
those that receive help folding from chaperones or other 
cellular machinery. It has been estimated that about 
C « 10% of proteins fall into this category [22]. 

Together, these assumptions allow us to build a model 
for the predicted distribution of protein chain lengths. 
The size distribution of domains (SI Fig. 1) can be 
roughly approximated by a Gaussian with parameters 
fiN and ctat. In that case, 
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where / _1 (t c /to) = N c is the chain length corresponding 
to t c for a specific model (power law, exponential, etc.), 
and C is the percentage of proteins with folding times 
slower than r c . 

This framework is, of course, an approximation. There 
are undoubtedly many other factors affecting the opti- 
mal sizes of proteins beyond merely their folding times. 
Metabolic efficiency, structural packing constraints [23, 
24] , and the behavior of specific proteins in their local cel- 
lular environments certainly play a role. Nonetheless, the 
concept of an upper limit to the folding times is reason- 
able, and our aim here is to simply extract some general 
comments about the reasonableness of predicted folding 
times, rather than make quantitatively accurate predic- 
tions. 



RESULTS 

Direct fitting of all proposed models to the available 
data yields reasonable results for each (Fig. 1). Each 
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FIG. 1. The predicted models for the folding rate law, overlaid with measurements of folding times (top), and the putative 
folding time distributions these models imply (bottom) . Parameter values derived from a maximum likelihood fit are displayed, 
along with intervals indicating the spread in the fit probability distribution. Dark grey shading indicates a factor of e" , while 
light grey indicates e 2tT . 



TABLE I. Likelihood Ratios of ^-Maximized Models 

Model a Pr. Law Exp. S. E. 1/2 S. E. 2/3 

Power Law 1.59 • 10 3 7.98 • 10° 3.82 • 10 1 

Exponential 6.30 • 1(T 4 5.03 • 10" 3 2.41 • 10~ 2 

S. E. 1/2 1.25 ■ 10 _1 1.99 • 10 2 4.79-10° 

S. E. 2/3 2.61 ■ 10~ 2 4.15 ■ 10 1 2.09 ■ 10 _1 

a Primary model is on the left, alternate model along the top - 
thus, a larger number favors the model in the leftmost column. 

model reports a scale parameter (a) of approximately 
3, which indicates that 68% of proteins will have fold- 
ing times within a factor of e CT « 20 from the time pre- 
dicted by the rate law, and 95% will be within a factor of 
e 2a « 400. Since the available data spans folding times of 
more than 9 orders of magnitude (between 1.9 • 10~ 7 and 
9.9 • 10 2 seconds), this demonstrates that chain length 
captures the majority of variation in folding times, since 
orthogonal factors (topology, mutations, etc.) account 
for approximately 400 2 ~ 10 5 orders of magnitude of 
variation. 

Do the data support any one model? The power-law 
model is slightly favored by comparing the likelihoods 
that each model generated the observed data (Table I). 
In such comparisons, typically a ratio of 10 2 or greater 
is considered significant, and often models differ by hun- 
dreds of orders of magnitude [25] - thus, the power law 



model is better supported by the data, but only by a 
modest margin. Further, an attempt to fit the stretched 
exponential form with (3 as a variable parameter resulted 
in an unreasonably small value of j3 along with a very 
large value of a, resulting in a fit that is very close to the 
power law (SI Fig. 3). Finally, the power law model has 
the smallest fit a, indicating that it explains the most 
variation in the data, and attributes less to orthogonal 
factors. 

It is clear, however, that there is little difference be- 
tween the models in the range of available data. These 
models diverge significantly only for very large proteins 
(Fig. 2) . They yield significantly different predictions for 
the distribution of folding times, generated by transform- 
ing the known distribution of domain sizes into each of 
the different models (Fig. 1). The most significant dif- 
ferences are in the tails of these distributions, where the 
exponential forms predict much longer folding times for 
the largest proteins (Table II) . The power law model pre- 
dicts no proteins fold in times longer than an hour, while 
the exponential forms show a significant number of pro- 
teins with folding times longer than a day (Fig. 2). 

An evaluation of the reasonableness of these folding 
time distributions is provided by the critical time r c for 
each model (SI Table III). For reasonable values of C, the 
power law r c is on the order of minutes. The exponen- 
tial forms, on the other hand, predict r c is on the order 
of hours. Indeed, picking a reasonable value of r c and 
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Domain Size (residues) T o (seconds) 



FIG. 2. The predicted folding times from each model in Fig- 
ure 1, in a direct comparison. Intuitive timescales are denoted 
for clarity. 



TABLE II. Estimated Fraction of Protein Domains with Fold- 
ing Times Greater than Time Indicated 





Hour 


Day 


Month 


Year 


Power Law 


0.41% 


0.01% 


0.00% 


0.00% 


Exponential 


9.56% 


5.70% 


3.34% 


2.46% 


S. E. 1/2 


5.48% 


2.37% 


0.95% 


0.57% 


S. E. 2/3 


7.49% 


3.53% 


1.74% 


1.11% 



calculating the probability of observing the empirically 
observed domain size distribution (Fig. 3) shows that for 
values of r c ~ 10 seconds, the power law model is clearly 
the best. However, for any values of t c greater than 100 
seconds, the exponential laws are much better models. 

CONCLUSIONS 

Thus, we have a mixed conclusion. While the power 
law model appears to best explain the available raw data, 
it results in very fast predicted folding times. The ex- 
ponential forms, while doing a marginally poorer job of 
explaining the raw data, yield a distribution of folding 
times much more in line with what we expect from bi- 
ology. Given the current available data, no clear victor 
emerges. 

Previous theories have claimed that simply predicting 
one of the four laws investigated here is strong evidence 
in support of that theory. This is manifestly not the case 
- not only must the proposed law be reasonable, but it 
must also predict reasonable parameter estimates, and 
even then the supporting evidence the rate scaling law 
can provide given current data is limited. Conversely, 
the analytical theories mentioned here are not ruled out 
by the current available data. This is most striking in 



FIG. 3. The of probability of observing the observed domain 
sizes (SI Fig. 1) given a rate scaling law and a folding com- 
plexity cutoff time, r c . Assumes the Gaussian model (with 
C = 0.10) described in the main text. Note the y-axis is log 10 
and divided by 10 3 for clarity, so small differences on the plot 
are actually quite large. 



the case of the exponential form, since exponential scal- 
ing of the folding times has often been associated with 
Levinthal's paradox. This study shows that exponential 
scaling is reasonable given current experimental data, so 
long as the exponential scaling constant (a) is sufficiently 
small. 

Clear evidence for any one rate law remains missing, 
however with a few clear examples of very large globular 
proteins (500 residues or larger) capable of folding unas- 
sisted in vitro, it might be possible to discriminate be- 
tween the models proposed here. Figure 2 clearly shows 
the divergences between predicted folding times for large 
proteins, and shows how just a few data points in this 
extreme regime might be able to begin differentiating be- 
tween the proposed models investigated here. 
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FIG. 1. The size distribution of non- homologous protein 
domains listed in the PDB. Of the 12,151 sequences reported, 
39 are larger than 1000 residues (0.32%) - while not shown 
here for clarity, they were included in subsequent analy- 
ses. Data were obtained from the NIH's VAST algorithm 
(http://www.ncbi.nlm.nih.gov/Structure/VAST/nrpdb.html) 
on 5/5/2012 with a dissimilarity p-value of 10 -7 . 



DIFFERENT DATASETS OF PROTEIN 
FOLDING KINETIC DATA 



We are aware of four collections of protein folding ki- 
netic data, here termed "KineticDB" [1], "Liang" [2], 
"Muhoz" [3], "Finkelstein" [4]; note Finkelstein is re- 
portedly a subset of KineticDB. Each is purported to 
be extracted directly from the primary literature. Inter- 
estingly, while there is much overlap of reported proteins 
between each dataset, there are systematic inconsisten- 
cies between them. Most report that they extrapolate 
the folding times to zero denaturant - we suspect this is 
the origin of the discrepancy, but have not undertook a 
detailed investigation. Instead, we chose the KineticDB 
dataset because it contained the most proteins, appeared 
well curated, and restricted entries to proteins at zero de- 
naturant, near room temperature, and at neutral pH. 

Interestingly, despite their discrepancies, the datasets 
appear to be similar on a coarse level (Fig. 2). Further, 
our fitting analysis run on each independently results in 
parameters that are not too far from one another (Table 
I), and the rest of our results (which primarily follow from 
these parameters) are very similar between datasets. 

For this study, we simply extracted all non-mutant en- 
tries from the KineticDB and employed their reported 



FIG. 2. The rate-length data from each source used, plotted 
together. Even though there is some overlap in reported se- 
quences, many identical proteins have different chain lengths 
or folding times reported. We combined these data for our 
analysis, eliminating only identical measurements. 



TABLE I. Parameters Estimated from Different Datasets 





Pr. 


Law 


Expon. 


S. E 


1/2 


S. E 


2/3 


Dataset 


V 


a 


a 


a 


Q 


a 


a 


a 


KineticDB 


5.44 


3.06 


0.046 


3.32 


1.11 


3.13 


0.37 


3.19 


KDB (mut) a 


4.46 


2.42 


0.040 


2.53 


0.91 


2.45 


0.31 


2.47 


Liang 


4.75 


2.54 


0.041 


2.90 


0.94 


2.70 


0.31 


2.77 


Munoz 


4.79 


3.05 


0.054 


3.14 


1.04 


3.09 


0.37 


3.11 


Finkelstein 


5.24 


3.08 


0.048 


3.32 


1.08 


3.17 


0.36 


3.22 



The KineticDB dataset with mutants included. Since there are 
only a few proteins with mutants, and there are many mutants 
for these few proteins, this database gives artificially more 
weight to those individual proteins. 



chain lengths and folding times in water. 



STRETCHED EXPONENTIAL WITH /3 AS A 
FREE PARAMETER 



In the main text we investigated the specific stretched 
exponential forms proposed by extant simple models (i.e. 
with j3 as 1/2 and 2/3). There is no reason why j3 cannot 
be simply fit as a free parameter, however, and we have 
performed this fit (Fig. 3). Remarkably, the most likely 
parameters are those with a ~ 10 2 and j3 sa 1CP 2 , which 
is far outside the range predicted by theory. With these 
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FIG. 3. The parameter fit for a stretched exponential, logr oc 
aN 13 , with /3 a free parameter. Notice how the fit is perfectly 
straight on a log-log plot, a characteristic trait of power laws. 



FIG. 4. The parameter posteriors for each model are sharply 
peaked around their modes - plotted here is the maximum 
(not the marginal value) of the posterior at various values of 
the key parameters a or v. One can see that the likelihood 
has a sharply peaked value along this dimension. 



extreme parameters, the model appears to be a power law 
form over the range of fit data, and is linear on a log-log 
plot to about 10 25 residues. This simply verifies that the 
data are best explained by a power law, and demonstrates 
that the power law fit with free j3 is sufficiently flexible 
to capture this fact. 

BAYESIAN MODEL COMPARISON 



TABLE II. Bayes Factors Comparing Datasets 

Pr. Law Exp. S. E. 1/2 S. E. 2/3 

Power Law 1.32 • 10 5 3.55 • 10 1 4.74 • 10 2 

Exponential 7.55 • 1(T 6 2.68 ■ 1(T 4 4.74 • 10 2 

S. E. 1/2 2.81 ■ 10" 2 3.73 • 10 3 1.33 ■ 10 1 

S. E. 2/3 2.11 • 1(T 3 2.80 • 10 2 7.50 • 1(T 2 



As mentioned in the main text, it is entirely possible to 
pursue a full-fledged Bayesian analysis to determine the 
parameters for each model. Here, we show the common- 
alities and differences between the maximum likelihood 
approach pursued in the main text and a Bayesian ap- 
proach. We show that it makes little difference which is 
chosen. 

We have little a priori information about the parame- 
ters we are to fit besides the fact that each (a, v, a) must 
be greater than zero. Thus, we choose a uniform prior for 
each over the interval [0, oo], and write this distribution 
ir (6), where 9 will stand in for the vector of parameters 
relevant to a particular model. Then we can write the 
Bayesian posterior as 

P{6\D) oc C{D\e)ir(fi) 

with C the likelihood from the main text. Since tt is 
uniform, however, we can write 

P(0\D) oc C(D\0) 

as long as all the parameters are in the interval [0, oo]. 
Given the posterior P(9\D), it is common to simply take 
the mode as the best representative set of parameters - 
which are the maximum likelihood parameters used in 



the main manuscript. This choice is justified because we 
have found the posterior to be strongly peaked around 
the mode, as seen in Figure 4. 

The only other difference between a Bayesian analysis 
and the likelihood methods are the way models are com- 
pared. In contract to a likelihood ratio, Bayesian statis- 
tics recommends a Bayes' factor J- [5], which compares 
two models (Mi, M 2 ) 

= J e £ Ml (D\d)n Ml (e) 
12 J C M2 (D\9)7r M2 (0) 

which explicitly includes information from all possible 
values of the parameters. The likelihood ratios used in 
the main text are a saddle approximation to the integrals, 
and thus these two methods match in the case where the 
posteriors are highly peaked. Table II shows the calcu- 
lated Bayes' factors for each model. These Bayes' factors 
were calculated by also integrating over r , the param- 
eter accounting for a constant offset (or units) in our 
fits. The domain of integration for tq was restricted to 
[—50, 10] for numerical stability - increasing it beyond 
this range resulted in little difference. 
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TABLE III. Most Likely r c for Values of C (in seconds) 



C 



0.1 



0.05 



0.01 



0.001 



Power Law 


1.10 


10 1 


2.88 


10 1 


1.23 


10 2 


4.36 


10 2 


Exponential 


3.03 


10 2 


5.85 


10 3 


1.51 


10 6 


7.64 


10 8 


S. E. 1/2 


9.99 


10 1 


6.54 


10 2 


1.53 


10 4 


3.47 


10 5 


S. E. 2/3 


1.68 


10 2 


1.57 


10 3 


7.67 


10 4 


4.25 


10 6 



DETAILS OF THE GAUSSIAN FIT MODEL 

Here we go over some of the details of the Gaussian 
model that was used to predict the folding time distribu- 
tion. In the main text, we presented the central equation 



/-i(r c /ro) y/2wa 2 N 



exp 



2< 



C 



from which all of our subsequent analysis begins. Using 
this model, we can predict r c for a chosen model and 
chosen C. To do this, was fixed at 105, the empirical 
mode of the distribution of domain sizes (Fig. 1) - this 
appeared to make subsequent fits more robust. The pa- 
rameter er/v was then fit, via likelihood maximization, to 
the same empirical distribution. Subsequently, a series 



of values of C were chosen, and these implied values of 
t c for each model, resulting in Table III. 



It is also possible to calculate the likelihood of observ- 
ing the known empirical data given a specific r c , via likeli- 
hood maximization. The procedure here was straightfor- 
ward; first, C was set to 0.10, [In was again fixed at 105, 
r c was fixed at an arbitrary value for a chosen model, and 
(Tat was varied to maximize the likelihood of witnessing 
the empirical distribution given that model. The results 
for a scan of r c for each model is reported in Figure 3 in 
the main text. 
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